Text Mining with n-gram Variables
نویسندگان
چکیده
منابع مشابه
n-Gram-Based Text Compression
We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bi...
متن کاملN-gram-based Text Attribution
Quantitative authorship attribution refers to the task of identifying the author of a text based on measurable features of the author’s style—a problem that has practical application in areas as diverse as literary scholarship, plagiarism detection, and criminal forensics. Attribution methods generally follow a generative approach, wherein a statistical “profile” is created for a set of candida...
متن کاملN-Gram-Based Text Categorization
Text categorization is a fundamental task in document processing, allowing the automated handling of enormous streams of documents in electronic form. One difficulty in handling some classes of documents is the presence of different kinds of textual errors, such as spelling and grammatical errors in email, and character recognition errors in documents that come through OCR. Text categorization ...
متن کاملImproved Text Generation Using N-gram Statistics
In Natural Language Generation (NLG) systems, a generalpurpose surface realisation module will usually require the underlying application to provide highly detailed input knowledge about the target sentence. As an attempt to reduce some of this complexity, in this paper we follow a traditional approach to NLG and present a number of experiments involving the use of n-gram language models as an ...
متن کاملLanguage Identification of Short Text Segments with N-gram Models
There are many accurate methods for language identification of long text samples, but identification of very short strings still presents a challenge. This paper studies a language identification task, in which the test samples have only 5–21 characters. We compare two distinct methods that are well suited for this task: a naive Bayes classifier based on character n-gram models, and the ranking...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: The Stata Journal: Promoting communications on statistics and Stata
سال: 2017
ISSN: 1536-867X,1536-8734
DOI: 10.1177/1536867x1801700406